Skip to content

fix(timing): warmup pass before timing loop to amortise torch.compile JIT#70

Merged
ivanbasov merged 3 commits into
mainfrom
fix/timing-compile-warmup
Apr 23, 2026
Merged

fix(timing): warmup pass before timing loop to amortise torch.compile JIT#70
ivanbasov merged 3 commits into
mainfrom
fix/timing-compile-warmup

Conversation

@ivanbasov
Copy link
Copy Markdown
Collaborator

Summary

  • Adds a single warmup forward pass through pipeline_module before the timing loop in run_inference_and_decode_pre_decoder_memory
  • Triggers torch.compile lazy compilation so the JIT cost does not inflate the first-batch timing measurement
  • Guard: only runs when trt_context is None and _applied_compile (torch-only path with compile enabled)
  • CUDA sync after the warmup pass on GPU devices
  • Warmup logic extracted into _maybe_warmup_compile helper with 5 unit tests

Motivation

Without this, the first batch in the timing loop bears the full torch.compile lazy-compilation cost, skewing Phase Timing numbers — especially at low sample counts (PREDECODER_INFERENCE_NUM_SAMPLES=1):

Model forward (first batch)
Before ~887 ms
After ~1 ms

With large sample counts the JIT cost gets amortised naturally, but at small counts it dominates and makes Phase Timing numbers misleading. Proposed by Igor Almeida Baratta; approved by Ben Howe.

Test plan

  • Existing unit tests pass (test_inference_latency_timing.py, test_tensorrt_fallback.py)
  • Run with PREDECODER_INFERENCE_NUM_SAMPLES=1, confirm first-batch model-forward time matches steady-state
  • Run with TRT enabled, confirm warmup block is skipped
  • CI green

🤖 Generated with Claude Code

ivanbasov and others added 3 commits April 20, 2026 11:34
… JIT

Without this, the first batch in the timing loop bears the full
torch.compile lazy-compilation cost (~887 ms vs ~1 ms steady-state),
skewing Phase Timing numbers — especially at low sample counts like
PREDECODER_INFERENCE_NUM_SAMPLES=1.  The warmup only runs when
torch.compile is active and TRT is not in use.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extracts the warmup block into a named helper so it can be tested in
isolation.  Five tests cover: fires when compile is active (CPU), skipped
when compile is off, skipped when TRT context is present, CUDA sync called
on GPU device, CUDA sync not called on CPU device.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ivanbasov ivanbasov marked this pull request as ready for review April 20, 2026 18:37
@ivanbasov ivanbasov requested review from IgorBaratta, bmhowe23 and kvmto and removed request for kvmto April 20, 2026 18:37
Copy link
Copy Markdown
Collaborator

@IgorBaratta IgorBaratta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ivanbasov ivanbasov merged commit c71b4e4 into main Apr 23, 2026
17 checks passed
@ivanbasov ivanbasov deleted the fix/timing-compile-warmup branch April 23, 2026 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants